12 research outputs found

    Development Grouping of Synonym Set Thesaurus Vocabulary The Qur’an in English Using Hierarchical Clustering Algorithm

    Get PDF
    Research in the field of text mining to process entries or words from the Qur'an is very beneficial for Muslims. This study aims to establish a set of synonyms for the thesaurus in the words of the Qur'an. This research is used because the source of knowledge about the science of the Qur'an is still lacking. The dataset in this study uses the Corpus Qur'an and English Translation. This research is a research development of an article that has been published, namely "The Development of Al-Qur'an Vocabulary Set Synonyms with WordNet Approach" by Laras Gupitasari. Input from this research system uses nouns from the translation of English words in the Quran. The output of the system produces several groups that have the same level of closeness of meaning displayed, the first group means the word in the group has a close meaning. To produce output, this study uses word grouping with a hierarchical grouping method and calculates distances using common paths, then groups results according to the closeness of meaning from word entries. The evaluation in this study produced an F-Measure value of 76%, F-Measure Value is an evaluation to measure the accuracy of predictions issued by the system.Research in the field of text mining to process entries or words from the Qur'an is very beneficial for Muslims. This study aims to establish a set of synonyms for the thesaurus in the words of the Qur'an. This research is used because the source of knowledge about the science of the Qur'an is still lacking. The dataset in this study uses the Corpus Qur'an and English Translation. This research is a research development of an article that has been published, namely "The Development of Al-Qur'an Vocabulary Set Synonyms with WordNet Approach" by Laras Gupitasari. Input from this research system uses nouns from the translation of English words in the Quran. The output of the system produces several groups that have the same level of closeness of meaning displayed, the first group means the word in the group has a close meaning. To produce output, this study uses word grouping with a hierarchical grouping method and calculates distances using common paths, then groups results according to the closeness of meaning from word entries. The evaluation in this study produced an F-Measure value of 76%, F-Measure Value is an evaluation to measure the accuracy of predictions issued by the system

    Typo handling in searching of Quran verse based on phonetic similarities

    Get PDF
    The Quran search system is a search system that was built to make it easier for Indonesians to find a verse with text by Indonesian pronunciation, this is a solution for users who have difficulty writing or typing Arabic characters. Quran search system with phonetic similarity can make it easier for Indonesian Muslims to find a particular verse.  Lafzi was one of the systems that developed the search, then Lafzi was further developed under the name Lafzi+. The Lafzi+ system can handle searches with typo queries but there are still fewer variations regarding typing error types. In this research Lafzi++, an improvement from previous development to handle typographical error types was carried out by applying typo correction using the autocomplete method to correct incorrect queries and Damerau Levenshtein distance to calculate the edit distance, so that the system can provide query suggestions when a user mistypes a search, either in the form of substitution, insertion, deletion, or transposition. Users can also search easily because they use Latin characters according to pronunciation in Indonesian. Based on the evaluation results it is known that the system can be better developed, this can be seen from the accuracy value in each query that is tested can surpass the accuracy of the previous system, by getting the highest recall of 96.20% and the highest Mean Average Precision (MAP) reaching 90.69%. The Lafzi++ system can improve the previous system

    ANALISIS PENGARUH METODE COMBINE SAMPLING DALAM CHURN PREDICTION UNTUK PERUSAHAAN TELEKOMUNIKASI

    Get PDF
    Churn prediction pada pelanggan telekomunikasi merupakan upaya memprediksi/mengklasifikasi pelanggan jasa telekomunikasi yang berhenti atau berpindah berlangganan dari suatu operator ke operator yang lain. Namun dataset pada kasus churn ini biasanya memiliki kelas yang imbalance dimana jumlah instance suatu kelas (kelas active atau tidak churn atau mayor atau negatif) jauh lebih besar dari jumlah kelas yang lain (kelas churn atau minor atau positif). Akibatnya, kebanyakan classifier cenderung memprediksi kelas mayor dan mengabaikan kelas minor sehingga akurasi kelas minor sangat kecil. Salah satu pendekatan yang dilakukan untuk menangani permasalahan ini adalah dengan memodifikasi distribusi instances dari dataset yang digunakan atau yang lebih dikenal dengan pendekatan sampling-based. Teknik resampling ini meliputi oversampling, under-sampling, dan combine-sampling. Analisis yang dilakukan pada penelitian ini adalah mengetahui bagaimana pengaruh metode combine sampling yang digunakan terhadap akurasi prediksi data churn dengan melakukan penghitungan akurasi model churn prediction yang dinyatakan dalam bentuk lift curve, top decile dan gini coefficient serta f-measure untuk penghitungan akurasi prediksi data sebagai data yang imbalance. Hasil yang didapat dari penelitian menunjukkan bahwa metode combine sampling belum sesuai diterapkan pada data churn, karena cenderung masih menghasilkan nilai top decile yang kecil. Tetapi secara umum metode combine sampling ini mampu meningkatkan akurasi untuk memprediksi data minor. Dengan penerapan metode combine sampling, data churn yang memiliki tingkat imbalance yang besar dapat diklasifikasi tanpa mengorbankan data minor yang menjadi fokus penelitian. Metode combine sampling yang digunakan juga memiliki hasil evaluasi yang berbeda terhadap dataset sebagai data churn dan sebagai dataimbalance

    ANALISIS PENGARUH METODE COMBINE SAMPLING DALAM CHURN PREDICTION UNTUK PERUSAHAAN TELEKOMUNIKASI

    Get PDF
    Churn prediction pada pelanggan telekomunikasi merupakan upaya memprediksi/mengklasifikasi pelanggan jasa telekomunikasi yang berhenti atau berpindah berlangganan dari suatu operator ke operator yang lain. Namun dataset pada kasus churn ini biasanya memiliki kelas yang imbalance dimana jumlah instance suatu kelas (kelas active atau tidak churn atau mayor atau negatif) jauh lebih besar dari jumlah kelas yang lain (kelas churn atau minor atau positif). Akibatnya, kebanyakan classifier cenderung memprediksi kelas mayor dan mengabaikan kelas minor sehingga akurasi kelas minor sangat kecil. Salah satu pendekatan yang dilakukan untuk menangani permasalahan ini adalah dengan memodifikasi distribusi instances dari dataset yang digunakan atau yang lebih dikenal dengan pendekatan sampling-based. Teknik resampling ini meliputi oversampling, under-sampling, dan combine-sampling. Analisis yang dilakukan pada penelitian ini adalah mengetahui bagaimana pengaruh metode combine sampling yang digunakan terhadap akurasi prediksi data churn dengan melakukan penghitungan akurasi model churn prediction yang dinyatakan dalam bentuk lift curve, top decile dan gini coefficient serta f-measure untuk penghitungan akurasi prediksi data sebagai data yang imbalance. Hasil yang didapat dari penelitian menunjukkan bahwa metode combine sampling belum sesuai diterapkan pada data churn, karena cenderung masih menghasilkan nilai top decile yang kecil. Tetapi secara umum metode combine sampling ini mampu meningkatkan akurasi untuk memprediksi data minor. Dengan penerapan metode combine sampling, data churn yang memiliki tingkat imbalance yang besar dapat diklasifikasi tanpa mengorbankan data minor yang menjadi fokus penelitian. Metode combine sampling yang digunakan juga memiliki hasil evaluasi yang berbeda terhadap dataset sebagai data churn dan sebagai dataimbalance

    ANALISIS PENGARUH METODE COMBINE SAMPLING DALAM CHURN PREDICTION UNTUK PERUSAHAAN TELEKOMUNIKASI

    Get PDF
    Churn prediction pada pelanggan telekomunikasi merupakan upaya memprediksi/mengklasifikasi pelanggan jasa telekomunikasi yang berhenti atau berpindah berlangganan dari suatu operator ke operator yang lain. Namun dataset pada kasus churn ini biasanya memiliki kelas yang imbalance dimana jumlah instance suatu kelas (kelas active atau tidak churn atau mayor atau negatif) jauh lebih besar dari jumlah kelas yang lain (kelas churn atau minor atau positif). Akibatnya, kebanyakan classifier cenderung memprediksi kelas mayor dan mengabaikan kelas minor sehingga akurasi kelas minor sangat kecil. Salah satu pendekatan yang dilakukan untuk menangani permasalahan ini adalah dengan memodifikasi distribusi instances dari dataset yang digunakan atau yang lebih dikenal dengan pendekatan sampling-based. Teknik resampling ini meliputi oversampling, under-sampling, dan combine-sampling. Analisis yang dilakukan pada penelitian ini adalah mengetahui bagaimana pengaruh metode combine sampling yang digunakan terhadap akurasi prediksi data churn dengan melakukan penghitungan akurasi model churn prediction yang dinyatakan dalam bentuk lift curve, top decile dan gini coefficient serta f-measure untuk penghitungan akurasi prediksi data sebagai data yang imbalance. Hasil yang didapat dari penelitian menunjukkan bahwa metode combine sampling belum sesuai diterapkan pada data churn, karena cenderung masih menghasilkan nilai top decile yang kecil. Tetapi secara umum metode combine sampling ini mampu meningkatkan akurasi untuk memprediksi data minor. Dengan penerapan metode combine sampling, data churn yang memiliki tingkat imbalance yang besar dapat diklasifikasi tanpa mengorbankan data minor yang menjadi fokus penelitian. Metode combine sampling yang digunakan juga memiliki hasil evaluasi yang berbeda terhadap dataset sebagai data churn dan sebagai data imbalance

    Decision boundary setting and classifier combination for text classification

    Get PDF
    This thesis presents a promising boundary setting method for solving challenging issues in text classification to produce an effective text classifier. A classifier must identify boundary between classes optimally. However, after the features are selected, the boundary is still unclear with regard to mixed positive and negative documents. A classifier combination method to boost effectiveness of the classification model is also presented. The experiments carried out in the study demonstrate that the proposed classifier is promising

    Pembangunan Monolingual Word Alignment pada Terjemahan Al-Quran Berbahasa Indonesia

    No full text
    For centuries the Quran is present in the midst of civilization and human society consisting of 6236 verses. To measure the semantic similarity between the translations of Al-Quran verses that aim to understand more deeply the meaning related to the verses of the Koran requires a method one of them with monolingual word alignment. Monolingual alignment is a word alignment method that identifies the similarity between words in existing pairs of sentences. In addition to the use of monolingual alignment methods in measuring the similarity of existing words, it also needs a dataset that serves as a collection of a document whose content is the semantic relationship between sets that exist. But the dataset monolingual word alignment for Indonesian in format MSR is still very limited in volume. In this research, some features in the method of monolingual alignment are applied in the development of the dataset monolingual word alignment Indonesian in format MSR, which is textit align identical words, align PFA and align word sequences by generating an F1 value of 86.94 %. For the best results F1 is generated from some combination of alignment features with align identical words and align PFA with F1 result of 88.83 %. Keywordsi Al-Quran, Monolingual Alignment, MS

    PERPADUAN COMBINED SAMPLING DAN ENSEMBLE OF SUPPORT VECTOR MACHINE (ENSVM) UNTUK MENANGANI KASUS CHURN PREDICTION PERUSAHAAN TELEKOMUNIKASI

    No full text
    Churn prediction adalah suatu cara untuk memprediksi pelanggan yang berpotensial untuk churn. Data mining khususnya klasifikasi tampaknya dapat menjadi alternatif solusi dalam membuat model churn prediction yang akurat. Namun hasil klasifikasi menjadi tidak akurat disebabkan karena data churn bersifat imbalance. Kelas data menjadi tidak stabil karena data akan lebih condong ke bagian data yang memiliki komposisi data yang lebih besar. Salah satu cara untuk menangani permasalahan ini adalah dengan memodifikasi dataset yang digunakan atau yang lebih dikenal dengan metode resampling. Teknik resampling ini meliputi over-sampling, under-sampling, dan combined-sampling. Metode Ensemble of SVM (EnSVM) diharapkan dapat meminimalisir kesalahan klasifikasi kelas mayor dan minor yang dihasilkan oleh classifier SVM tunggal. Dalam penelitian ini akan dicoba untuk memadukan combined sampling dan EnSVM untuk churn predicition. Pengujian dilakukan dengan membandingkan hasil klasifikasi CombinedSampling-EnSVM dengan SMOTE-SVM (perpaduan oversamping-SVM) dan pure-SVM. Hasil pengujian menunjukkan bahwa metode CombinedSampling-EnSVM secara umum hanya mampu menghasilkan performansi Gini Index yang lebih baik daripada metode SMOTE-SVM dan tanpa resampling (pure-SVM)

    Building Synonym Sets for English WordNet with Robust Clustering using Links Method

    No full text
    English WordNet is an important synonym set to present the similarity of meanings between words. Synonym Set is built using Oxford Thesaurus which is accessed through lexico.com, which is a part of the lexical database that will be used. After using the extraction process through Oxford Thesaurus it will produce a synonym set with the same meaning between words. The difference between WordNet and ordinary dictionaries is that the word is interconnected with other words. One method employed for this approach is Robust Clustering Using Links method, which is similarity values and synonym sets that have been created to be used to build a lexical database. Therefore the main purpose of the development of the English WordNet is to produce an accurate synonym set using clustering techniques. The evaluation calculation will use the F-measure method and will use the gold standard for the calculation method. With the ROCK method, there is an increase in accuracy output from dataset input. Building the English wordnet is to improve words that can be used to help research and development of other language wordnets with role models using more accurate English wordnets. And the use of ROCK method there is an increase in the accuracy upon results of the development of English wordnet compared to the previous method, which is using hierarchical clustering. The outcome of this study resulted in improved accuracy so that the ROCK method is one of the good methods used in the development of the English wordnet.English WordNet is an important synonym set to present the similarity of meanings between words. Synonym Set is built using Oxford Thesaurus which is accessed through lexico.com, which is a part of the lexical database that will be used. After using the extraction process through Oxford Thesaurus it will produce a synonym set with the same meaning between words. The difference between WordNet and ordinary dictionaries is that the word is interconnected with other words. One method employed for this approach is Robust Clustering Using Links method, which is similarity values and synonym sets that have been created to be used to build a lexical database. Therefore the main purpose of the development of the English WordNet is to produce an accurate synonym set using clustering techniques. The evaluation calculation will use the F-measure method and will use the gold standard for the calculation method. With the ROCK method, there is an increase in accuracy output from dataset input. Building the English wordnet is to improve words that can be used to help research and development of other language wordnets with role models using more accurate English wordnets. And the use of ROCK method there is an increase in the accuracy upon results of the development of English wordnet compared to the previous method, which is using hierarchical clustering. The outcome of this study resulted in improved accuracy so that the ROCK method is one of the good methods used in the development of the English wordnet

    Analysis of Name Entities in Text Using Robust Disambiguation Method

    No full text
    Named entities are proper nouns or objects contained in a text, such as a person's name, country name, and others. Names of persons in some text are often ambiguous, which makes it difficult for ordinary people to find out these same names are the same person or not.  An ambiguity of names also found in hadith, like the name Abdullah in hadith number 86 and 2411, that might be the same person or might be different. Based on this problem, then this study focuses on named entity disambiguation, which considered further semantic and lexical relation between a named entity. Expected in the future, it would help people to understand the ambiguity of the name or distinguish ambiguous names. The method used in this research was Robust Disambiguation because, in this method, the context of the named entity considered. The resulted output obtained was in the form of named entity that grouped based on the same person or different person processed with Density-based Spatial Clustering of Applications with Noise.  This research resulted in an accuracy value of 90%, a precision value of 97%, and a recall value of 89% obtained from actual value and predicted valu
    corecore